The ATP and WTA Tours are, at their core, world tours. And yet, for all the tennis visualizations out there, I've seldom seen a visualization of players moving en masse across the planet. Here I display two different ways to visualize the movement of players between tournaments: a Sankey diagram, and a good ol' fashioned map of the world. These visualizations are for the 2017 ATP World Tour.
While neither visualization is perfect, they play complimentary roles. The Sankey diagram does an excellent job of showing the temporal sequence of tournaments and the quantities of players moving between them. The one thing it doesn't really capture is geography; much of the overall tournament schedule and individual players' calendars is based on distance and ease of travel between tournaments. The world map is a less interesting stand-alone figure, but it is a useful complement to the Sankey diagram.
Part 1: Data Wrangling¶
I use two data sources to create these visualizations:
The first is match result data from Jeff Sackman, creator of Tennis Abstract, the Match Charting Project, and the blog Heavy Topspin. I use this data to determine the tournaments played by each player. Once I've created each individual's tournament schedule, I can simply aggregate them to create groups of players traveling between tournaments.
The second is a free database of cities and their latitudes and longitudes from SimpleMaps.com. I use the latitude and longitude information to plot the cities and travel paths of players on a world map.
I've shown below how I create the input data, but if you're not interested in the intricacies of data wrangling, feel free to skip straight to Part 2.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# modules for Sankey diagram
import plotly as py
py.offline.init_notebook_mode(connected=True)
import plotly.io as pio
from IPython.display import Image
#modules for World Map
import cartopy
import cartopy.crs as ccrs
path = '/Users/admin/Documents/Personal_Data_Project/2_Tennis/0_Raw_Input_Data/3_match_results/tennis_atp_master'
tour_matches = pd.read_csv(path + '/atp_matches_2017.csv')
print('There are '
+ str(len(tour_matches))
+ ' matches present, with a total of '
+ str(len(tour_matches.columns))
+ ' columns. The earliest date is '
+ str(tour_matches['tourney_date'].min())
+ ' and the last date is '
+ str(tour_matches['tourney_date'].max())
+ '.'
)
tour_matches.columns
I stick to just 10 columns from the tour_matches dataframe, which can be classified into three groups.
tourney_columns = ['tourney_name', 'surface', 'tourney_level', 'tourney_date']
winner_columns = ['winner_id', 'winner_name', 'winner_rank']
loser_columns = ['loser_id', 'loser_name', 'loser_rank']
tour_matches.loc[:, tourney_columns + winner_columns + loser_columns].head()
Below I show the dataframe containing city locations. Many cities share the same name, but I just use the most populated city for each name. Fortunately the major tennis tournaments are hosted at the most populated city for each city name.
path = '/Users/admin/Documents/Personal_Data_Project/2_Tennis/0_Raw_Input_Data/4_geography'
cities = pd.read_csv(path + '/worldcities.csv')
cities = (cities
.sort_values(by=['population'])
.drop_duplicates(subset='city_ascii', keep='last')
)
cities.head()
Constructing Player Schedules¶
As we saw with the tour matches dataframe, tournament attendance information for each player is split into the winner and loser columns. Here I split the dataframe into the the winner and loser sides and then concatenate the two. That way I have all matches for each player on a different row, regardless of the outcome of the match. (That way I don't miss a tournament where someone lost in the first round).
player_columns = ['player_id', 'player_name', 'player_rank']
player_schedules = pd.concat([
(tour_matches.loc[:, tourney_columns + winner_columns]
.rename(columns=dict(zip(winner_columns, player_columns)))),
(tour_matches.loc[:, tourney_columns + loser_columns]
.rename(columns=dict(zip(loser_columns, player_columns))))
])
Next, I need to prepare the player_schedules dataframe for merging with city location data. There's a few broad categories of tasks here:
- I eliminate Davis Cup matches from the dataset. This is mostly for convenience and to keep the number of events at a reasonable number for display.
- The spelling for some tournaments doesn't match the spelling of the city in the location data, or it doesn't look good for display.
- Some tournament names are either not locations or not a city name in the location data. I have identified the nearest city in the location data for these tournaments on an ad hoc basis. I'm not an expert on geography, so there are likely more precise substitutes available.
#drop davis cup from schedules
player_schedules = player_schedules.loc[
~player_schedules.tourney_name.str.contains('Davis Cup'), :]
#cleaning event/city names
tourney_name_cleaning_dict = {"'S-Hertogenbosch" : "'s-Hertogenbosch",
'Marrakech' : 'Marrakesh',
'Rio De Janeiro' : 'Rio de Janeiro',
'Us Open' : 'US Open',
'Canada Masters' : 'Montreal',
'Beijing ' : 'Beijing'
}
player_schedules.tourney_name = player_schedules.tourney_name.replace(
tourney_name_cleaning_dict)
player_schedules['location'] = player_schedules.tourney_name
player_schedules.loc[(player_schedules.tourney_name == 'London')
& (player_schedules.surface == 'Grass'),
'tourney_name'] = "Queen's Club"
player_schedules['location'] = player_schedules['location'].str.replace(' Masters', '')
location_cleaning_dict = {'Australian Open' : 'Melbourne',
'Roland Garros' : 'Paris',
'Wimbledon' : 'London',
'US Open' : 'Queens',
'Antwerp' : 'Antwerpen',
'Monte Carlo' : 'Monaco',
'Estoril' : 'Lisbon',
'Halle' : 'Bielefeld',
'Eastbourne' : 'Brighton',
'Bastad' : 'Halmstad',
"Queen's Club" : 'London',
'Umag' : 'Trieste',
'Gstaad' : 'Sion',
'Kitzbuhel' : 'Innsbruck',
'Los Cabos' : 'Cabo San Lucas'
}
player_schedules.location = player_schedules.location.replace(location_cleaning_dict)
player_schedules.loc[player_schedules.tourney_name != player_schedules.location].head()
Now I merge the player schedules with city locations. I never actually use the country variable, but it is helpful to keep around for sense-checking the data and ensuring I selected the proper city.
player_schedules = player_schedules.merge(
cities.loc[:, ['city_ascii', 'lat', 'lng', 'country',]],
left_on='location',
right_on='city_ascii',
how='left').drop(columns=['city_ascii'])
player_schedules.head()
I'm going to construct a dataframe of transitions between events. Each row will represent one or more players playing two tournaments consecutively. The Sankey diagram in particular will look better with a node for the start and end of each player's season. Below I insert a row for the start and end of the season for each player.
player_starts = player_schedules.loc[:, player_columns[0:-1]].drop_duplicates()
player_starts['tourney_name'] = 'Start of Season'
player_starts['tourney_date'] = 20170101
player_ends = player_schedules.loc[:, player_columns[0:-1]].drop_duplicates()
player_ends['tourney_name'] = 'End of Season'
player_ends['tourney_date'] = 20171231
player_schedules = pd.concat([player_starts, player_ends, player_schedules], sort=False)
player_schedules = (player_schedules
.sort_values(by=['player_id', 'tourney_date'])
.drop_duplicates()
.reset_index(drop=True))
player_schedules.head()
Lastly, I use the shfit function to create a group of columns for the next tournament each player attends. Now each row represents a pair of tournaments attended by the player, and each tournament appears twice (once as the "current" tournament and once as the next tournament.
player_nexts = (player_schedules
.groupby(['player_id', 'player_name'])
['tourney_name', 'location', 'lat', 'lng', 'country'].shift(-1)
)
player_nexts.columns = 'next_' + player_nexts.columns
player_schedules = player_schedules.join(player_nexts)
player_schedules.head()
Constructing Transitions Between Tournaments¶
Below I define three functions to create final inputs for my data visualizations. The Sankey diagram will require two dataframes: a dataframe of tournaments for the nodes and a dataframe of tournament transitions for the links. The transitions dataframe is different from the player schedules dataframe because it has just one row for each pair of tournaments (the number of players traveling from Melbourne to Montpellier is represented by a single row).
Tournaments and transitions between them will be color coded by surface and event level (Slams, Masters 1000s, and 250s/500s), and those colors are assigned here as well.
def assign_tournament_color(df):
df['color'] = df.color(df.color_index)
return df
def construct_tournament_list(df):
_df_out = df.loc[:, tourney_columns]
_df_out = (_df_out
.drop_duplicates()
.sort_values(by=['tourney_date', 'tourney_name'])
.reset_index(drop=True)
)
cmap_dict = {'Hard' : plt.cm.get_cmap('Blues'),
'Clay' : plt.cm.get_cmap('Reds'),
'Grass': plt.cm.get_cmap('Greens'),
np.nan : plt.cm.get_cmap('Greys')
}
level_dict = {'G' : 0.9,
'M' : 0.6,
'A' : 0.3,
np.nan : 0.5
}
_df_out['color'] = _df_out.surface.map(cmap_dict)
_df_out['color_index'] = _df_out.tourney_level.map(level_dict)
_df_out = _df_out.apply(assign_tournament_color, axis=1)
_df_out['color'] = _df_out.color.str[0:3] + (tournament_alpha,)
return _df_out.drop(columns=['color_index'])
def construct_transitions(df):
df_tourneys = construct_tournament_list(df)
id_dict = dict(zip(df_tourneys.tourney_name, df_tourneys.index))
color_dict = dict(zip(df_tourneys.tourney_name, df_tourneys.color))
df_trans = (df
.fillna('Unknown')
.groupby(['tourney_date', 'tourney_name',
'location', 'lat', 'lng', 'country',
'next_tourney_name', 'next_location',
'next_lat', 'next_lng', 'next_country'
])
.size()
.reset_index()
.replace(to_replace={'Unknown' : np.nan})
.rename(columns={0 : 'num_players'})
)
df_trans['origin_color'] = (df_trans.tourney_name.map(color_dict).str[0:3]
+ (transition_alpha,))
df_trans_ids = df_trans.replace(id_dict)
return df_tourneys, df_trans, df_trans_ids
tournament_alpha = 1
transition_alpha = 0.5
peak_ranks = player_schedules.groupby('player_name')['player_rank'].min()
player_list = peak_ranks[peak_ranks <= 20].index
selection = player_schedules.loc[player_schedules.player_name.isin(player_list), :]
tournaments, transitions, transition_ids = construct_transitions(selection)
tournaments.head()
transitions.head()
def create_tournaments_sankey(
tournaments, transition_ids,
pad=500, title='ATP World Tour'):
data = dict(
type='sankey',
orientation= 'v',
node = dict(
pad = pad,
thickness = 33,
line = dict(
color = "black",
width = 0.5
),
label = tournaments.tourney_name,
color = 'rgba' + tournaments.color.astype(str)
),
link = dict(
source = transition_ids.tourney_name,
target = transition_ids.next_tourney_name,
value = transition_ids.num_players,
color = 'rgba' + transition_ids.origin_color.astype(str)
))
layout = dict(
title = title,
height = 3500,
width = 2000,
font = dict(
size = 20
)
)
fig = dict(data=[data], layout=layout)
img_bytes = pio.to_image(fig, format='png')
return Image(img_bytes)
I've chosen to start by looking at player travel for all players who attained a peak rank of 20 or less in 2017. This is a population of more than 20 players because players can enter and leave the top 20 every week when the rankings are updated.
Hard court events are shown in blue, clay court events are shown in red, and grass court events are shown in red. Slams are darkest, Masters 1000 events have intermediate darkness, and other events (250s and 500s) are lightest. There are only two green colors because there isn't a Masters event played on grass.
create_tournaments_sankey(tournaments,
transition_ids,
pad=200,
title='2017 ATP Player Travel, Peak Rank in Top 20')
It's easiest to get used to these kinds of figures by focusing on a single event. Take Indian Wells as an example. The vast majority of top 20 players that competed at Indian Wells came from Acapulco and Dubai. There's also a few players that came from Delray Beach. These are all hard court events. However, there are also a few players that went to South America after the Australian Open to compete on clay. Looking at the South American swing, most players that went played multiple tournaments while there. Few players sought out the South American clay followed by a hard court warm up event before Indian Wells.
Because I'm looking at a high level population, almost all of the players that played at Indian Wells went immediately to Miami. (These Masters tournaments are so close on the calendar that a player that wins both in the same season is often said to have completed the Sunshine Double).
From there, most players begin the clay court season, though one player of note skipped the 2017 clay court season and proceeded directly to the grass of Stuttgart. That player's name is Roger Federer.
player_schedules.loc[(
player_schedules.tourney_name == 'Miami Masters')
& (player_schedules.next_tourney_name == 'Stuttgart')
& player_schedules.player_name.isin(player_list), :]
It's also interesting to look at how players arrive at the grey End of Season node. Some top players ended their seasons at Wimbledon that year, though most finished at the Paris Masters or the ATP World Tour Finals (London).
player_schedules.loc[(
player_schedules.tourney_name == 'Wimbledon')
& (player_schedules.next_tourney_name == 'End of Season')
& player_schedules.player_name.isin(player_list), :]
While the visualization for players in the top 20 was relatively easy to digest, I'm more interested in a broader look at the sport. I've shown below the same figure, but this time for players that were in the top 100 for at least one week in 2017. I haven't included any events below ATP 250s, so players that who appear to skip large parts of the season or end their season early may actually have been playing lower level events.
player_list = peak_ranks[peak_ranks <= 100].index
selection = player_schedules.loc[player_schedules.player_name.isin(player_list), :]
tournaments, transitions, transition_ids = construct_transitions(selection)
create_tournaments_sankey(tournaments,
transition_ids,
title='2017 ATP Player Travel, Peak Rank in Top 100')
The US Open Series is often advertised as The Road to the US Open, but for some players that road is paved with grass or clay.
World Maps¶
One of the most interesting features of the above diagrams is the South American Clay Court Season: Quito, Buenos Aires, Rio de Janeiro, Sao Paulo. Some players, presumably those with better results on clay, head to South America to play on clay, then come to North America for two Masters events on hard courts, and then return to clay in Europe (or Houston). Some skip Indian Wells and Miami entirely (those are likely the lower ranked players).
Geography clearly plays a role in the attendance at these tournaments. If you're already seeking out clay before the (first) hardcourt season ends, and there are back to back tournaments in the area, why not attend all of them?
To get a look at the geography of the ATP world tour, I've plotted the same data on a world map. Colors have the same meaning as before, and lines are thicker when more players take the same path.
World map plotting and boundaries is made possible with the help of Met Office's python module cartopy.
def plot_tour_travel(
transitions, extent=False,
title='Player Travel in the 2017 ATP Tour'):
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.Robinson())
greys_cmap = plt.cm.get_cmap('Greys')
if extent:
ax.set_extent(extent, crs=ccrs.Geodetic())
else:
ax.set_global()
ax.add_feature(cartopy.feature.OCEAN, zorder=0, facecolor=greys_cmap(0.98))
ax.add_feature(cartopy.feature.LAND, zorder=0, facecolor=greys_cmap(0.95))
ax.coastlines()
for i in range(len(transitions)):
plt.plot([transitions.lng[i], transitions.next_lng[i]],
[transitions.lat[i], transitions.next_lat[i]],
linewidth = np.log2(transitions.num_players[i]) / 5,
color=transitions.origin_color[i],
transform=ccrs.Geodetic())
plt.scatter(transitions.lng[i], transitions.lat[i],
color=transitions.origin_color[i], zorder=1000,
transform=ccrs.Geodetic())
plt.title(title, fontsize=20)
plt.show()
plot_tour_travel(transitions)
I should note that while this looks like a flight map, players may not actually be taking these paths. Players likely return home, or to wherever they do the bulk of their training, during the longer gaps in the season. Players may also travel for Davis cup events which do not tend to follow the general geographic meandering of the ATP tour.
Below is the same image, this time zoomed in on Europe. There's so much back and forth in European tournaments that it's hard to follow the sequences of events with this figure, but it still gives an idea of where the events are. Keen observers will note that some events seem to be missing, but what is actually happening is multiple events held in the same city are stacked on eachother (Roland Garros and the Paris Masters in Paris; Queen's Club, Wimbledon, and the ATP World Tour Finals in London).
plot_tour_travel(transitions, extent = [-20, 45, 30, 60],
title='Player Travel in the 2017 ATP Tour, Europe')
Part of what makes this so messy is that unlike in South America, high level play takes place in Europe in N different parts of the season: on hard courts after the Australian Open, on clay and grass after Miami, and on hard courts after the US Open and the Asian swing. To get a sense of the temporal back and forth, I've shown below the events played by Alexander Zverev. The color of each path between tournaments gets "hotter" as the season progresses.
def plot_player_travel(df, player):
selection = df.loc[df.player_name == player, :].reset_index(drop=True)
greys_cmap = plt.cm.get_cmap('Greys')
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.Robinson())
ax.set_global()
ax.add_feature(cartopy.feature.OCEAN, zorder=0, facecolor=greys_cmap(0.98))
ax.add_feature(cartopy.feature.LAND, zorder=0, facecolor=greys_cmap(0.85))
ax.coastlines()
# ordinarily looping through a dataframe is bad practice
# vectorised operations should be used instead
# however, plotting in different colors requires different calls to plt.plot
for i in range(len(selection)):
plt.plot([selection.lng[i], selection.next_lng[i]],
[selection.lat[i], selection.next_lat[i]],
linewidth = 2,
color=plt.cm.get_cmap('plasma')(i / len(selection)),
transform=ccrs.Geodetic())
plt.scatter(selection.lng[i], selection.lat[i],
color=plt.cm.get_cmap('plasma')(i / len(selection)), zorder=1000,
transform=ccrs.Geodetic())
plt.title('2017 Tournament Schedule for ' + player, fontsize=20)
plt.show()
plot_player_travel(player_schedules, 'Alexander Zverev')
Zverev, a European himself, had a particularly European schedule: the Australian Open, hard courts in Europe, two hard court Masters in the United States, the clay and grass season in Europe, the US Open Series in the United States, three hard court events in Asia, and more hard courts in Europe.
This is actually pretty common. The sport that is often dominated by Europeans has a heavily European tour schedule. But not all players have that schedule. Clay court specialists, like Dominic Thiem, tend to head to South America for at least some of the clay tournaments there.
plot_player_travel(player_schedules, 'Dominic Thiem')
Americans like John Isner often play a majority of their events in the United States, and spend less time in Europe for the clay court season.
plot_player_travel(player_schedules, 'John Isner')
Without taking a close look, I mostly assumed that the top players all played pretty much the same schedule, and most of the variation came from lower level players. While that's partially true, I think it would be more accurate to say that most of the schedule variation comes with lower level events. The Slams and Masters events actually give a lot of cohesion to the schedule, and players fill in the rest of their schedule according to their preferences in geography, surfaces, and appearance fees.